Jerome Kelleher, Kevin Thornton, Jaime Ashander, and Peter Ralph (me)
12 March 2018 :: bioRxiv
This talk:
slides at github: petrelharp/ftprime_ms/docs
For a set of sampled chromosomes, at each position along the genome there is a genealogical tree that says how they are related.
A tree sequence describes this, er, sequence of trees.
Observations:
The pedigree (parental relationships) plus crossover locations would give us the tree sequence for everyone, ever.
Much less can fully describe the history relevant to a sample of genomes.
This information is equivalent to the Ancestral Recombination Graph (ARG).
Kelleher, Etheridge, and McVean introduced the tree sequence data structure for a fast coalescent simulator, msprime.
stores genealogical and variation data very compactly
efficient algorithms available:
tree-based sequence storage closely related to haplotype-matching compression
at n = 107
Storing a tree sequence in the four tables - nodes, edges, sites, and mutations - is succinct (no redundancy).
These are stored efficiently (hdf5) on disk with a bit more information (e.g., metadata).
Who inherits from who; only necessary for coalescent events.
Records: interval (left, right); parent node; child node.
The ancestors those happen in.
Records: time ago (of birth); ID (implicit).
When state changes along the tree.
Records: site it occured at; node it occurred in; derived state.
Where mutations fall on the genome.
Records: genomic position; ancestral (root) state; ID (implicit).
Coalescent simulations are much faster than forwards-time, individual-based simulations
because they don’t have to keep track of everyone, only the ancestors of your sample.
But: selection, or sufficient geographic structure, break the assumptions of coalescent theory.
So, if you
then you have to do forwards-time, individual-based simulations.
To model linked selection, we need chromosome-scale simulations.
Then every individual needs to carry around her genotype (somehow). Even at neutral sites!
Bummer.
But wait…
If we record the tree sequence that relates everyone to everyone else,
after the simulation is over we can put neutral mutations down on the trees.
Since neutral mutations don’t affect demography,
this is equivalent to having kept track of them throughout.
This means recording the entire genetic history of everyone in the population, ever.
It is not clear this is a good idea.
Every time an individual is born, we must:
This produces waaaaay too much data.
We won’t end up needing the entire history of everyone ever,
but we won’t know what we’ll need until later.
How do we get rid of the extra stuff?
Question: given a tree sequence containing the history of many individuals, how do we simplify it to only the history of a subset?
Concretely, given an input tree sequence and a subset of its nodes we call the samples, we want a new tree sequence for which:
All marginal trees match the corresponding subtree in the input tree sequence.
Every non-sample node in marginal trees has at least two children.
All nodes and edges are ancestral to at least one sample.
No adjacent redundant edges (e.g., (ℓ, x, p, c) + (x, r, p, c) → (ℓ, r, p, c)).
Answer: to simplify a tree sequence to the history of the samples:
Paint each sampled chromosome a distinct color.
Moving back up the tree sequence, copy colors of each chromosome to the parental chromosomes they inherited from.
If two colors go in the same spot (coalescence), replace with a new color (unique to that ancestor). Output a node for the ancestor and an edge for the coalescence.)
Once all colors have coalesced in a given segment, stop propagating it.